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Data  aequatione  quotcunque  fluentes  quantitae  involvente 
fluxiones  invenire  et  vice  versa. 

Sir  Isaac  Newton,  1687. 

(It  is  useful  to  solve  differential  equations.)1 
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Chapter  1 


Introduction 


This  introduction  summarizes  the  motivation,  problem,  approach,  and  contributions 
of  this  thesis. 


1.1  Motivation 

High  performance  computing  is  in  a  transition  between  vector  supercomputing  and 
scalable  multicomputing.  The  software  strategies  which  are  effective  under  these  two 
paradigms  are  different.  Scalable  multicomputers  require  scalable  algorithms.  These 
are  algorithms  whose  elapsed  time  complexities  do  not  grow  “unreasonably  fast”  as 
the  problem  size  scales  with  the  computer  system. 


1.2  Problem 

The  efficiency  of  grid  based  computations  depends  on  the  load  balance  among  pro¬ 
cessors.  Scalable  grid  computations  require  scalable  load  balancing  methods.  These 
methods  must  compute  a  balanced  load  distribution  while  preserving  adjacency  rela¬ 
tionships  of  a  computational  domain. 


1.3  Approach 

Treat  the  computers  in  a  scalable  multicomputer  as  nodes  in  a  computational  grid. 
Use  finite  difference  techniques  to  solve  the  heat  equation  on  this  grid.  Adjust  the 
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CHAPTER  1.  INTRODUCTION 


workloads  among  computers  to  equal  the  evolving  solution.  Use  only  transfers  be¬ 
tween  pairs  of  adjacent  computers  for  this  adjustment. 


1.4  Contributions 

A  simple,  correct  and  scalable  load  balancing  method  for  grid  computations.  Correct¬ 
ness  of  the  balance  and  adjacency  properties  are  demonstrated.  Elapsed  execution 
time  diminishes  with  increasing  problem  size.  This  demonstrates  the  algorithm  is 
scalable  in  N  without  upper  bound  and  load  balance  can  be  maintained  at  negligible 
cost  for  large  grid  computations. 

The  algorithm  is  efficient.  Simulations  of  generic  problems  from  computational 
fluid  dynamics  predict  that  a  fraction  of  a  second  of  elapsed  time  is  sufficient  to 
rebalance  after  grid  adaptations.  Even  less  time  is  required  to  solve  a  static  load 
balancing  problem  for  a  million  point  grid  computation  on  512  computers. 

An  analysis  of  convergence  is  provided  for  the  static  load  balancing  problem. 
The  time  to  converge  the  dynamic  problem  is  usually  different  than  predicted  by 
this  static  analysis.  For  this  reason  simulations  are  recommended  to  provide  bounds 
for  specific  problems  instances.  The  fourth  chapter  presents  simulations  of  generic 
problem  instances  which  are  of  interest  in  computational  fluid  dynamics. 


Chapter  2 


The  load  balancing  problem 


The  “load  balancing”  problem  has  been  often  discussed  but  rarely  given  formal  defini¬ 
tion.  Load  balancing  is  necessary  in  order  to  achieve  effective  use  of  a  multicomputer. 
In  the  context  of  an  operating  system  it  is  necessary  to  keep  all  computers  busy  in  or¬ 
der  to  achieve  maximal  throughput.  A  mechanism  must  exist  to  ensure  that  tasks  are 
distributed  evenly  among  the  computers.  Otherwise  some  computers  will  be  under¬ 
utilized  and  overall  throughput  decreased.  In  the  context  of  algorithms  which  require 
synchronization  poorly  balanced  loads  will  result  in  some  computers  sitting  idle  while 
they  wait  for  more  heavily  loaded  computers  to  reach  synchronization  points.  This 
problem  affects  most  scientific  calculations  because  most  numerical  algorithms  require 
frequent  synchronization. 


Problem  2.1  (Load  Balancing):  Let  f 1  represent  a  set  of  elements  of  a  computa¬ 
tional  domain  and  ip  a  set  of  computers.  Let  (j>  :  0  — »  ip  and  <f>'  :  fi  — >  ip  represent 
assignments  of  elements  to  computers  and  u.  u'  the  measured  workloads  on  the  com¬ 
puters  under  the  assignments  (p1  <j>' .  The  load  balancing  problem  is  to  compute,  given 
an  initial  assignment  (p.  a  new  assignment  <f>'  and  vector  Aw  such  that 

(2.1. a)  u  +  Aw  =  v! . 

(2.1.b)  Ui  —  uj'  ¥i,j  G  if>. 

(2.1.c)  <t>'  preserves  adjacencies  of  fi. 
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The  load  balancing  problem  is  often  described  as  two  problems:  the  “static” 
problem,  in  which  the  goal  is  to  compute  an  initial  assignment  from  a  set  of  elements 
to  a  set  of  computers,  such  that  every  computer  has  the  same  workload;  and  the 
“dynamic”  problem,  in  which  the  goal  is  to  maintain  a  balanced  workload  under 
changing  conditions.  This  distinction  is  unecessary  since  the  static  problem  is  a  special 
case  of  the  dynamic  problem  in  which  the  initial  assignment  is  trivial.  The  simplest 
trivial  assignment  locates  all  elements  on  a  single  computer.  Further  discussion  of 
the  static  problem  will  assume  this  assignment. 

Call  conditions  (2.1. a),  (2.1.b)  the  “balance”  requirement  and  (2.1.c)  the  “adja¬ 
cency”  requirement.  The  adjacency  requirement  holds  that  two  elements  uj2  which 
are  adjacent  in  the  computational  domain  0  must  reside  on  either  the  same  computer 
or  on  a  pair  of  adjacent  computers.  The  workload  iq  on  a  computer  is  determined 
from  the  loads  of  the  individual  elements  uj  which  are  assigned  to  ^  as 

Ui  =  ^  size(u)  (2.1) 

{ui£Q\4>(iAj)=ipi} 


where  size(u;)  measures  the  workload  imposed  by  to  on  the  computer  ipi*  Taken 
in  isolation  this  requirement  permits  pathological  solutions,  such  as  one  where  all 
elements  are  assigned  to  a  single  computer.  When  this  requirement  is  combined  with 
the  balance  requirement  pathological  solutions  are  no  longer  permitted. 

The  adjacency  requirement  is  necessary  to  minimize  the  cost  of  communication  in 
domain  decomposed  calculations.  Under  the  assumption  that  most  communication 
occurs  among  adjacent  elements  of  a  computational  domain  an  assignment  (p  which 
preserves  the  adjacencies  in  0  minimizes  the  distances  messages  have  to  traverse  in 
the  multicomputer.  This  assumption  applies  to  many  problems  in  computational 
fluid  dynamics  and  finite  elements.  As  a  result  it  is  possible  for  these  calculations 
to  execute  on  multicomputers  without  contention  for  communication  channels  during 
a  complete  simultaneous  exchange  of  data  among  adjacent  elements.  This  type  of 
exchange  is  the  dominant  form  of  communication  for  many  scalable  algorithms. 

Bhokari  [10]  noted  that  the  load  balancing  problem  in  full  generality  is  NP- 
complete  by  transformation  from  the  graph  isomorphism  problem.  The  graph  iso¬ 
morphism  problem  decides  whether  two  arbitrary  graphs  are  the  same.  An  algorithm 
which  computes  solutions  to  problem  2.1  can  solve  the  graph  isomorphism  problem  if 
it  considers  the  adjacencies  of  the  problem  domain  to  define  one  graph  and  the  inter¬ 
connection  structure  of  the  computer  system  another.  If  the  interconnection  structure 
is  assumed  to  be  a  mesh  then  algorithms  which  solve  problem  2.1  cannot  solve  the 
graph  isomorphism  problem.  In  this  case  NP-completeness  of  2.1  still  follows  from 
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the  balance  requirement.  The  proof  is  by  transformation  from  the  partition  problem 

[26]: 


Given  a  finite  set  0  of  elements  u>  each  with  positive  integer  size(cj).  Does 
there  exist  a  subset  O'  C  fi  such  that 

y,  size(cj)  =  y  size(u;) 

This  proof  can  be  invalidated  by  making  a  further  assumption  that  elements  uj  have 
equal  weight.  Then  without  loss  of  generality  size(cj)  =  1  for  all  to.  Under  this 
assumption  polynomial  and  even  logarithmic  methods  exist  to  satisfy  the  balance 
requirement  (2.1. a),  (2.1.b).  The  problem  of  computing  Aw  or  u'  can  be  solved  by  a 
simple  tree  structured  algorithm.  The  algorithm  executes  in  a  sequential  number  of 
steps  which  is  logarithmic  in  the  size  of  the  computer  system.  In  order  to  be  useful 
any  load  balancing  method  should  be  as  simple  to  implement  as  this  algorithm  and 
should  require  no  more  elapsed  time. 

This  simple  algorithm  executes  on  every  computer  y  The  goal  of  the  algorithm 
is  to  identify  the  average  -s0  of  the  workloads  u,-  under  an  initial  assignment  < f> .  Com¬ 
puters  are  identified  with  nodes  of  a  tree  with  the  understanding  that  leaf  nodes  are 
their  own  children  and  the  root  it’s  own  parent. 

Algorithm  SIMPLE: 

•  Compute  local  workload  ut  from  expression  (2.1). 

•  Receive  sum  sj  and  count  Cj  from  every  child  tpj. 

•  Compute  ct  —  I  —  J2j  cj- 

•  Compute  Si  —  ( Ui  +  J2j  sjcj)/ci- 

•  Send  Si  and  c;  to  parent  if’h- 

Algorithm  SIMPLE  begins  execution  at  the  leaf  nodes  and  terminates  at  the  root. 
At  each  node  of  the  tree  a  computer  receives  the  average  workload  in  each  subtree 
below  it.  The  computer  uses  these  subtree  averages  to  compute  a  new  average  for  the 
subtree  of  which  it  is  the  root.  It  passes  this  new  average  to  it’s  parent. 

By  induction  it  is  easy  to  see  that  this  algorithm  terminates  in  a  state  where  s0  for 
computer  0  at  the  root  of  the  tree  is  equal  to  the  average  workload  among  all  of  the 
computers.  This  termination  condition  does  not  satisfy  the  balance  requirement.  In 
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order  to  satisfy  (2.1. a)  it  is  necessary  to  compute  a  vector  A u.  This  is  accomplished 
by  communicating  s0  through  the  tree  in  the  opposite  order,  from  the  root  to  the 
leaves. 

Algorithm  SIMPLE2: 

•  Receive  .s0  from  the  parent  computer. 

•  Send  So  to  every  child  computer. 

•  Compute  Aui  —  s0  —  ut. 

Algorithms  SIMPLE  and  SIMPLE2  contain  serial  dependencies.  These  serial 
dependencies  make  the  algorithms  inefficient.  They  are  inefficient  because  each  com¬ 
puter  is  idle  in  all  but  one  of  the  sequential  steps  of  execution.  It  would  be  nice  to 
find  a  concurrent  algorithm  to  compute  s0.  A  concurrent  algorithm  like  the  following 
one  is  potentially  more  efficient  because  no  computer  is  idle  in  any  step.  It  is  not 
difficult  to  imagine  this  algorithm  converges  to  something  like  s0.  It  has  even  been 
proposed  informally  as  a  load  balancing  method.  Unfortunately  it  would  not  a  make 
a  very  good  one. 

Algorithm  AVERAGE: 

•  Compute  local  workload  ut  from  expression  (2.1). 

•  Let  u;(°)  =  up 

•  For  to  =  1  to  r 

—  Send  u'\m  to  each  of  J  neighbors  ipp 
—  Receive  u'^m  1 1  from  every  neighbor  ?p:r 
—  Compute  u,-m'  =  (1  /  J)Y^j  u'\m  X'- 

•  EndFor  TO. 

•  Compute  Aui  —  u'\T^  —  u 

Algorithm  AVERAGE  is  a  concurrent  iteration  in  which  every  computer  receives 
the  workloads  u'^m  ^  at  it’s  neighbors,  computes  their  average  value,  and  then  ad¬ 
justs  it’s  own  workload  u'\m^  to  equal  that  average.  It  is  easy  to  be  persuaded  that 


9 


this  converges  to  a  steady  state  in  which  u'^  =  and  this  is  the  truth.  Un¬ 

fortunately  this  steady  state  rarely  satisfies  the  balance  requirement  of  problem  2.1 
1.  For  any  one  dimensional  problem  the  steady  state  reduces  to 

,(m)  _  J_  (  Am- 1)  ,(m  1)N 

^  i  2  \  ^  ^  i—di  J 

This  solution  can  be  expanded  in  Taylor  series  to  demonstrate  that 

/  ,  / 

^  2+cfo  "T"  ^  i—di 

+  ^|(<i.3/6)  +  O(rfi-) 

ax  aar 

f)y.f  ■  f)uf^ 

du'2 

2u't  +  ^ di 2  +  0(dz4) 

0  +  0(dz2)  (2.2) 

In  one  dimension  the  steady  solutions  of  algorithm  AVERAGE  are  second  order 
accurate  solutions  of  the  Laplace  equation  (2.2).  These  are  straight  lines  of  arbitrary 
slope.  Only  the  line  of  slope  zero  satisfies  (2.1.b). 

An  extension  of  this  argument  to  higher  dimensions  follows  immediately  from  the 
theory  of  finite  difference  equations. 

Although  algorithm  AVERAGE  is  unlikely  to  converge  to  solutions  of  problem 
2.1  there  are  reasons  to  like  it.  ft  is  concurrent,  and  scalable  in  the  sense  that  it  can 
execute  on  a  multicomputer  without  contention  for  communication  channels.  The 
converged  solutions  can  be  described  by  Taylor  expansions  since  all  transfers  occur 
between  pairs  of  adjacent  computers.  The  algorithm  also  holds  the  promise  of  com¬ 
puting  solutions  which  satisfy  the  adjacency  requirement.  Since  all  exchanges  occur 
between  adjacent  computers  it  should  be  possible  to  perform  these  exchanges  without 
destroying  existing  adjacencies.  Although  AVERAGE  is  not  a  correct  algorithm  for 
load  balancing  it  represents  an  approach  which  will  be  used  in  the  next  chapter  to 
derive  a  correct  and  scalable  load  balancing  method. 


2  u'i  = 


+ 


di1 


1  Specifically,  this  steady  state  satisfies  the  balance  requirement  exactly  when  the  boundary  con¬ 
ditions  are  periodic. 
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Chapter  3 


A  scalable  diffusion  method 


Algorithm  DIFFUSION: 

•  Compute  Ui  from  expression  (2.1). 

•  U'i  =  Ui 


For  /  =  1  to  r 


(o)  / 

-  v\  '  —  u'i 

—  For  m  —  1  to  v 

-  Send  v\m  to  all  neighbors  tpj 

-  Receive  from  neighbors  tpj 

( m )  (  A  \ 


-  l\ 

-  EndFor  m 

i  d 

—  U  i  —  V;  ’ 


(m- 1)  ,  d°- 

-J  vi  +  l+JA 


—  Send  u'i  to  all  neighbors  “tpj 

—  Receive  u'j  from  neighbors  tpj 


—  Transfer  A  (u'i 


units  of  work  to  each  neighbor  tpj 


•  EndFor  1 
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e  =  0.1 

64 

N  (total  processors) 
512  4,096  8,000  32K 

256K 

106 

A 

.36 

.21 

.16 

.16 

.16 

.16 

.16 

r 

4 

4 

4 

4 

4 

4 

4 

V 

7 

5 

4 

4 

3 

4 

4 

TIME 

21 

14 

13 

12 

12 

12 

12 

e  =  0.01 

64 

N  (total  processors) 
512  4,096  8,000  32K 

256K 

106 

A 

.36 

.21 

.16 

.16 

.16 

.16 

.16 

r 

7 

8 

8 

8 

8 

8 

7 

V 

13 

8 

7 

7 

7 

7 

7 

TIME 

84 

56 

49 

48 

47 

46 

45 

Table  3.1:  Values  for  constants  of  algorithm  DIFFUSION  for  given  N.  TIME  is 
defined  by  equation  (3.26)  and  is  proportional  to  the  elapsed  time. 


Heat  diffusion  is  a  process  in  nature  in  which  thermal  energy  diffuses  from  hot 
regions  into  cold  ones.  The  heat  equation  —  V2,t7  =  0  describes  this  process.  When 
taken  literally  the  terms  of  this  equation  read  that  the  rate  of  change  in  any 
element  ut  of  a  domain  u  depends  on  the  curvature  V2t7  in  the  vicinity  of  iq.  This 
locality  makes  the  heat  equation  a  good  model  for  a  scalable  load  balancing  method. 
Because  the  dependency  is  local  an  algorithm  based  on  the  heat  equation  requires 
only  local  exchanges  of  information  between  computers.  Because  heat  diffusion  is  a 
concurrent  process  it  is  a  good  model  for  a  concurrent  algorithm. 

Algorithm  DIFFUSION  is  a  concurrent  iteration  in  which  every  computer  de¬ 
rives  the  local  curvature  V2t7  and  then  adjusts  it’s  workload  according  to  a  finite 
difference  approximation  to  the  heat  equation.  J  is  the  number  of  computers  which 
are  immediately  adjacent  to  V’i-  The  constants  r,  v  and  A  are  obtained  from  table  3.1 
for  instances  of  the  static  problem.  For  instances  of  the  dynamic  problem  the  value 
of  r  presented  in  this  table  is  a  lower  bound.  In  these  instances  simulations  should 
be  used  to  predict  an  exact  value  for  r. 


3.1.  DERIVATION 
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3.1  Derivation 


Derive  the  algorithm  for  a  domain  corresponding  to  a  three  dimensional  mesh  con¬ 
nected  multicomputer.  Consider  the  heat  equation  in  three  dimensions 


ut  —  V2u 

-  UXX  +  Uyy  T  ^ZZ 


(3.i; 


Taylor  expand  u  to  obtain  each  successive  term  of  equation  (3.1).  The  first  term  ut 
is  obtained  by  expanding  in  t  with  all  derivatives  evaluated  at  (x,y,z,t). 


u(x.  y.  z.  t  —  dt)  —  u(x,y ,  z,t)  +  utdt  +  0(dt2) 

(u(x,y,z,t  +  dt)  -  u(x,y,z,t)' 


Ut 


dt 


+  0  (dt) 


Expanding  u  in  the  spatial  variables  reveals  the  curvature  terms  (where  omitted 
coordinates  are  interpreted  as  (x,y,z,t))  to  be 


u(x  +  dx,  •,  •,  ■)  =  «(■,  •,•,•)  +  uxdx  + 

dx 2  dx3 

-  -U  <u  - 

1  o 

u{x  —  dx,  ■,  ■,  ■ )  =  u(  —  uxdx  + 

dx2  dx3 

2  6 

u(x  +  dx,  ■,■,•)  +  u(x  —  dx,  )  =  2u(-, + uxxdx2  +  0{dx4) 


u,,. ...  ^  -f*  uxxx  £.  T  0 ( dx  ) 

+  0 (dx4) 


u., 


i(x  +  dx,  ■,  ■,  ■)  +  u(x  —  dx,  —  2 u{-,  ■,■,■) 
dx2 


+  0  (dx2 


Similar  expansions  in  y,z  show  that  the  heat  equation  can  be  rewritten 
u(x,y,z,t  +  dt)  -  u(x,y,z,t)  1 

- 77 - =  —7  (w(x  +  dx,  y,z,t)  +  u(x  -  dx,y,z,t)  + 

dt  dxz 

u(x,  y  +  dy,  z,t)  +  u{x,  y  -  dy,  z,t)  +  u{x,  y,z  +  dz,t)  + 
u(x,  y,z  —  dz,  t )  —  6u(x,  y,  z,  t )) 

Since  the  spatial  dimension  is  arbitrary  take  dx  =  1.  Identifying  A  with  the  time 
step  dt  and  taking  the  spatial  terms  on  the  right  at  time  t  +  A  yields  an  implicit  time 
stepping  scheme  with  unconditional  stability. 
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u(x.  y,  z,t)  —  { 1  +  6A)u(x,  y.z.  /  —  A)  —  A  ['u(x  +  dx ,  y.z.  t  —  A)  (3-2) 

+  u(x  -  dx ,  /y,  r.  /  -  A)  +  '«(x,  //  +  dy ,  -c.  /.  -  A)  +  u(x,  y  -  dy,  z,t  +  A) 

+w(x,  y,  #  +  <fa,  i  +  A)  +  u(x,  y,z  —  dz,  t  +  A)] 

In  order  to  compute  solutions  at  successive  time  intervals  A  it  is  necessary  to 
invert  the  relation  u ^  =  Auh+A)  by  solving 


,-(t+A)  =  (3.3) 

From  (3.2)  it  is  apparent  that  each  row  of  A  contains  a  diagonal  term  (1  +  6A) 
and  six  offdiagonals  —A.  Let  A  —  (D  —  T)  where  D  is  diagonal.  Then  the  relation 
t(h)  =  A(7h+A)  is  equivalent  to  r7h+A)  =  D~1Tu^t+A^  +  D_1  .  This  relation  is  satisfied 
by  fixed  points  of  the  Jacobi  iteration 

[-p+A)j  <m)  =  D-1T  ^(i+A)j  (m_1)  +  D~ (3.4) 

The  matrix  fa_1T  has  a  zero  diagonal  and  six  offdiagonal  terms  y+ga-  is  a 

diagonal  matrix  with  terms  1+16A.  Iteration  (3.4)  is  the  central  loop  of  algorithm 
DIFFUSION.  This  iteration  is  convergent  from  all  initial  conditions  and  has  spectral 
radius  defined  by  (3.23).  Since  the  iteration  is  concurrent  in  all  of  the  unknowns  it 
converges  at  the  same  rate  as  a  Gauss- Seidel  iteration. 


3.2  Correctness 

The  correctness  of  algorithm  DIFFUSION  is  implied  if  all  solutions  of  DIFFU¬ 
SION  satisfy  the  conditions  of  problem  2.1.  Recall  these  conditions: 

(2.1. a)  u  +  A u  =  u' . 

(2.1.b)  Ui  —  Uj  \/i,j  6 

(2.1.c)  (f)  preserves  adjacencies  of  fL 


3.2.  CORRECTNESS 
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3.2.1  Balance 

Recall  that  the  balance  requirement  is  defined  by  (2.1.a),(2.1.b).  Consider  the  finite 
difference  equation  (3.2)  rearranged  to  express  the  change  in  load  with  each  iteration 

u(x,  y,Z,t+  A)  -  u(x ,  y ,  z:t)  =  A  [u(x  -  \  .y.z.t-  A)  +  u(x  -  1,  y,  z,t  +  A) 

+u(x,  y  +  l,z,t  +  A)  +  u(x,y  -  l,z,t  +  A) 
+u(x,y:z  +  l,t  +  A)  +  n{x.  y.s  -  l,t  +  A) 

-6u(x,y,z,t  +  A)] 


or  as  a  vector  equation  with  matrix  operator  C 

u(t  +  A)  —  u(t )  =  A  Cu(t  +  A)  (3.5) 

Expand  the  solution  u(t)in  an  eigenvector  basis  x  of  C 


i,j,k 


and  rewrite  the  vector  equation  (3.5)  in  this  expansion 

^  1  +  A ^Xi^j^k  ^  )  £i,j,k  A  ^  )  T. ( /  T  A)  X  tj^k  (3.6) 

i,j,k  hjik 

For  the  purpose  of  analysis  assume  a  periodic  domain.  The  eigenvalues  Xijtk  of  C  are 


J 


k' 


\ij}k  —  2  3  —  cos27t - cos  27t - cos  27 r  — 


n 


n 


n 


(3.7) 


where  i,  j  and  k  range  from  0  to  n/2  —  1.  Use  the  knowledge  of  these  eigenvalues  and 

to  further  simplify  (3.6) 

^  )  ( Vji.j\k (l  T  A)xi-jtk  [f  +  AA^ij]  0-i .jrk ( t ) X'l  j^k )  0 

i,j,k 

and  by  the  completeness  and  orthonormality  of  the  eigenvectors 


^i,j,k{.t  T  A)  [1  T  AXi}j}k]  (H,j,k(t)  —  0 
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showing  that  the  time  dependent  decay  of  each  component  i.j.  k  is 

(3. 

An  arbitrary  component  k  decays  in  amplitude  by  a  factor  e  in  r  steps  if 

[1  +  AA;Ji,j  !  <  e  (3.9) 

Equation  (3.8)  shows  that  all  components  of  any  initial  disequilibrium  decay  in  am¬ 
plitude  to  zero.  Further  this  decay  occurs  at  exponential  rates,  ft  is  this  exponential 
property  that  underlies  the  d(logiV)  serial  time  complexity  of  the  algorithm.  The 
elapsed  time  complexity  (9(Al0g  fV)  is  discussed  later  in  this  chapter. 

Because  all  components  decay  to  zero  amplitude  the  calculation  of  u'  satisfies 
the  balance  requirement  (2.1.a),(2.1.b).  In  order  for  algorithm  DIFFUSION  to  be 
correct  it  is  also  necessary  that  the  transfer  of  work  cause  the  resulting  workloads 
to  equal  this  calculated  u' .  The  transfer  step  of  algorithm  DIFFUSION  sends 
A  (u'i  —  u'j)  to  each  neighbor  j .  On  a  three  dimensional  multicomputer  the  net  change 
at  each  computer  from  this  transfer  is 

A  (  (i  i  U  x—l,y,z) 

A  (u  j  U  i,z) 

A  (ll  j  U  x,y— l,z) 

A  (  ll  j  U  J:,y  ■  l.z ) 

A  (  U  i  U  x,y,z  —  1  ) 

A  (  il  j  U  x.y.z  ■  1  ) 

(1  +  6A)u(x,  ij.z.  /  -  A)  - 
A  [u(jr  +  dx ,  ij.  z.  I  —  A)  +  u(x  —  dx ,  ij.  z.  I  —  A) 

+u{x ,  y  +  dy ,  z,t  +  A)  +  u(x,  y  -  dy,  z,t  +  A) 

+  u(x.  y,z  +  dz,  t  +  A)  +  u(x.  y,z  —  dz,  t  +  A)] 

This  is  equation  (3.2). 


(A) 


1 


(0) 


1  +  AA 


*,iA 


3.2.2  Adjacency 

Recall  that  condition  (2.1.c)  is  the  adjacency  requirement.  If  an  initial  assignment  <f> 
preserves  adjacencies  of  a  domain  f l  then  the  new  assignment  (j)'  must  preserve  the 
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same  adjacencies.  This  means  that  every  pair  of  elements  u>i,u>2  which  are  adjacent 
in  fl  reside  on  either  the  same  computer  or  on  adjacent  computers. 

In  order  to  demonstrate  that  solutions  of  algorithm  DIFFUSION  satisfy  the 
adjacency  requirement  it  suffices  to  show  that  algorithm  DIFFUSION  preserves 
adjacencies  which  are  present  in  an  initial  assignment  <f>.  This  can  be  demonstrated 
by  an  informal  argument. 

•  <j>'  preserves  adjacencies  of  fi  if  <f>  preserves  adjacencies  of  fi  and  algorithm 
DIFFUSION  does  not  destroy  adjacencies. 

•  Algorithm  DIFFUSION  does  not  destroy  adjacencies  if  it  does  not  destroy 
them  in  any  transfer. 

•  Algorithm  DIFFUSION  need  not  destroy  adjacencies  in  any  transfer  because 
all  transfers  occur  between  adjacent  computers. 

This  argument  does  not  suggest  that  adjacencies  are  preserved  by  all  possible 
transfer  mechanisms.  Instead  it  suggests  that  transfer  mechanisms  exist  which  pre¬ 
serve  adjacencies.  To  make  this  point  concrete  consider  the  example  of  a  domain 
decomposed  grid  computation  on  a  three  dimensional  multicomputer.  Each  com¬ 
puter  solves  a  portion  of  a  three  dimensional  domain.  A  transfer  mechanism  which 
preserves  adjacencies  selects  for  transfer  those  elements  u  which  are  nearest  in  fi  to 
elements  on  adjacent  computers.  Such  a  mechanism  can  be  implemented  with  index¬ 
ing  to  have  a  fixed  cost  per  transaction.  Coarse  grained  versions  of  this  approach  have 
been  shown  effective  in  molecular  dynamics  and  vortex  calculations  [3,  7,  9].  In  some 
applications  it  may  not  be  practical  to  transfer  work  in  small  quantities.  In  these 
cases  it  may  be  more  efficient  to  postpone  transfers  until  the  algorithm  has  converged 
on  a  value  for  Aw. 


3.3  Static  problem  analysis 

Recall  that  the  static  load  balancing  problem  is  a  special  case  of  problem  2.1.  In  this 
special  case  the  initial  assignment  (p  maps  the  entire  domain  fl  to  a  single  computer  '</>;. 
Algorithm  DIFFUSION  solves  instances  of  the  static  problem  by  diffusing  domain 
elements  from  ipi  until  the  balance  requirement  is  met. 

It  is  possible  to  analyze  the  convergence  to  equilibrium  of  instances  of  the  static 
problem.  The  initial  assignment  <f>  corresponds  to  an  instance  of  a  unit  impulse.  A 
unit  impulse  is  a  summation  of  equally  weighted  sinusoids.  Recall  from  equation  (3.9) 
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that  to  reduce  the  amplitude  of  an  arbitrary  component  i,j,  k  by  e  in  r  steps  of  the 
method  requires  [1  +  AA ijrk]~T  <  e.  The  fastest  case  occurs  for  the  smallest  positive 
eigenvalue  A0,o,i  =  (2  —  2cos(27r^-))  which  corresponds  to  a  high  frequency  sinusoid. 
To  reduce  such  a  disturbance  requires 


In  e-1 


In  ( 1  —  A  ^2  —  2  cos  2i r^- 

Convergence  of  this  component  approaches  lne-1  for  large  n  since 

T 


(3,10) 


lim  In  1  +  A  2  —  2  cos  2tt 


n 


=  1 


Convergence  of  lowest  wavenumber  component  A„/2-i,n/2-i,n/2-i  is  slow  because 


lne-1 

In  (l  +  A(6  -  6  cos  tt^|5|)) 
This  converges  to  oo  for  large  n  because 


(3.11) 


lim  ln(l+A|6  —  6  cos  k—. - —  ]  |  =  0 

By  understanding  the  decay  of  individual  components  it  is  possible  to  analyze  the 
decay  of  a  summation  of  equally  weighted  components  over  time.  The  following 
text  uses  the  Poisson  bracket  (•,•)  to  represent  the  inner  product  operator.  When 
discussing  loads  or  eigenvectors  it  uses  u[x,y,z\  or  |y,':  tj.  z]  to  denote  the  vector 
element  which  corresponds  to  location  x,y,z  of  the  computational  grid  with  the 
convention  that  [0,  0,  0]  is  the  first  element  of  the  vector.  Then  the  initial  disturbance 
confined  to  a  particular  processor  x,  y,  z  can  be  written  as  an  eigenvector  expansion 


u(0)  A  '  O  (  0  )  X  l  ,n .  , . 

l,m,n 


(3.12) 


Since  by  assumption  the  instance  is  a  unit  impulse  the  initial  disturbance  u(0)  is  zero 
except  at  element  [x,y,^].  Then 


(xij}k,u(  0))  =  Xij}k[x,y,z\ 


(3.13) 
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This  is  equal  to  the  initial  amplitude  ciij,k(  0)  of  each  eigenvector  Xij,k 

^(0)) 


In  order  to  facilitate  analysis  it  has  been  assumed  that  the  computational  domain 
has  periodic  boundary  conditions.  Because  of  this  assumption  the  origin  of  the  coor¬ 
dinate  system  is  arbitrary.  Without  loss  of  generality  place  the  origin  at  the  source 
of  the  disturbance  and  take  x  —  y  —  z  —  0.  This  has  no  effect  on  the  eigenvectors 
Xij,k  and  from  (3.13),  (3.14) 

oy,*(0)  =  [0,0,0]  (3.15) 

Placing  the  origin  at  the  source  of  the  disturbance  is  convenient  in  considering  the  hrst 
element  of  the  eigenvectors  £;jy-[0,  0,  0].  C  has  n/2  distinct  eigenvalues  Xij:k  each  of 
algebraic  multiplicity  two.  Each  of  these  eigenvalues  has  geometric  multiplicity  eight, 
ie.  has  eight  linearly  independent  associated  eigenvectors  of  unit  length 

X a- U',  y,  z]  =  Ci,j,kF1  (27 rxi/n)  F2  (2tt yjjn)  F3  (2tt zk/n )  (3.16) 

where  each  F.t  is  either  sin  or  cos.  By  choosing  x  —  y  =  z  —  0  this  expression  is  zero 
except  for  the  single  eigenvector  for  which  jFj(i)  =  F2(x)  —  F3(x)  =  cos(x).  Without 
loss  of  generality  restrict  consideration  to  initial  disturbances  of  the  form 

«[o,  o,  01(0)  =  E  o,  o,o]  =  E  4, .a  (3.17) 

i,j,k 

From  (3.8)  dehne  the  time  dependent  disturbance  at  any  location  x‘,  y1,  z1 

u\x  ^  y  ,  z  ](r A)  h  )  y.  [1  T  AAj  j-y.]  x.  [.r  ,  y  ,  z  ] 

i,j,k 

=  E  ch,k  t1  +  AA,,*r  cos  27r  — 

n 

i  :hk 

y'j  o  z'k 

cos  27 r —  cos  27 r - 

n  n 


E  ai  ,m,n  (0  )xi  ,m^n  y 

\  l,m,n  l 

E  ,m,n  )  ai  ,m,n  (0) 

/,m,n 

E  a‘  ,m,n  (  0  )  &U  & jm  ^ kn 

l,m,n 

ai,j ,fc(0) 


(3,14) 


(3.18) 
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From  (3.16)  a  necessary  condition  for  a  normalized  eigenvector  i 


is 


1  =  cLk  E  cos2  (  27rT7  )  cos2  (  27rT- )  cos2  (  27r 

x,y,z  \  ft  /  \  It 

l 


zk 

n 


x,y,z 


' j kg  E  ^l  +  cos47r—  J  ^1  +  cos4tt— J  (l+cos47r — - 


x,y,z 


=  ch,kg  E  \{l  + cos  47r— J  1  + cos  47r^ 


+  22  cos  |  4-7T  —  j  '22  COS  (  47r  —  )  1  +  COS  4-7T 


JM 


_yj 

n 


zk ' 


Simplify  the  preceding  expression  by  the  following 

Lemma  1 


E, 

COS  47T —  =  0 

n 


Proof: 


22  cos  4?r —  =  y,Re  (t 

Z — J  T)  L '  V 

x  11  X 

=  &E 


lAnxi/n 


c lAnxifr 


=  Re  22  (t^‘/nY 

X 

^lAni/n  ^  _  ^i47ri/n^j1 


—  Re 

=  0 


Q.E.D. 

Repeated  application  of  lemma  (1)  to  equation  (3.19)  yields 

1  =  ^4£(i+cos4v)(i+cos4if) 

2  i 

~~  Chj,k  g 


zk 


n 


22  (  1  +  cos  47T —  +  E  (cos47t—  |  1  +  cos'4sr 


zk 


n 


(3.19) 
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=  c: 


1 

i,j,k  g 

1 


Y  i  +  Y  ( cos  4?r 


\_x,y,z  XiViZ 


zk ' 


n 


=  ci,j,k  g« 

From  which  one  concludes 


(3.20) 


8  A 1/2 


-  (l\ 


n 


V  *,  j,  k 


From  (3. 17), (3. 18)  and  (3.3)  the  time  dependent  disturbance  at  0,0,0  is  therefore 


8 


([0,  0,  0](rA)  —  —  Y 


n 


i,j,k 


(  i  j  k 

1  +  A2  (  3  —  cos  2tt - cos  2tt - cos  27r  — 

\  n  n  n 


(3.21) 


From  (3.21)  it  follows  that  u[0,0,0](tA)  <  e  when 


-E 

n  rr. 


(  i  j  k N 

1  +  A2  (  3  —  cos  2-7T - cos  2tt - cos  2tt  — 

\  n  n  n  , 


<  e 


and  therefore 


In  e-1 

ln(8/n)  1  +  A2  ^3  —  cos  2ir-^  —  cos  2ir^  —  cos  2ir^ 


(3.22) 


3.4  Scaling 

Equation  (3.22)  is  an  expression  for  r.  A  similar  expression  for  v  can  be  obtained  by 
considering  the  convergence  of  the  Jacobi  iteration.  From  the  Gersgorin  disc  theorem 
[24]  the  eigenvalues  A  of  the  Jacobi  iteration  are  bounded  |A|  <  tA|x.  Since  the  row 
and  column  sums  are  constant  and  the  iteration  matrix  is  nonnegative  ([24],  theorem 
8.1.22)  the  spectral  radius  equals  the  row  sum 

A^lr)  =  TTiA  |3'23) 

Define  the  error  in  a  current  value  w('"J  under  the  iteration  as  e(u  M)  =  («&*■)  -  u *) 
where  u*  is  the  fixed  point  of  the  iteration.  Then  for  v  >  0 


e(u^)  =  e((D~1T)l/u^)  =  (D~lTfe(u^) 


(3.24) 
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which  converges  when  p(D  1T)  <  1  since  p((D  1T)V) 
quantify  the  error  define  the  infinity  norm 


(p(D  1T)Y .  In  order  to 


|e(; 


1L 


max 

x,y,z 


e(; 


(m)  ' 

x,y,z* 


max 

x  ,y,z 


u (m)  —  u* 

x,y,z  x,y,z 


Using  this  norm  define  a  necessary  condition  to  improve  the  accuracy  of  the  solution 
u  by  a  factor  e  in  v  steps  to  be  ||e(u(1'))||00  <  e 1 1 e( 1 1 oo -  This  is  satisfied  when 
{p{D~1T)Y  <  e  and  thus 


In  i 


In 


6  A 

1+6  A 


(3.25) 


The  elapsed  time  complexity  of  algorithm  DIFFUSION  is  proportional  to  the 
product  tv.  Table  3.1  provides  values  for  the  parameters  r,  v.  A  and  e  which  minimize 
this  quantity.  Equation  (3.26)  demonstrates  that  tv  is  logarithmic  in  N. 


In 

.  e  In  e  1 

111  1+6A  ln(8/”)  XU* 

1  +  A2  | 

(3  —  COS  27T-  —  COS  271%  —  COS  27T-) 

n  n  n ) 

I 

(3.26) 


3.5  Remarks 

This  chapter  has  presented  a  simple  algorithm  which  satisfies  the  balance  and  ad¬ 
jacency  requirements  and  which  executes  in  decreasing  elapsed  times  for  increasing 
problem  sizes.  The  algorithm  causes  all  components  of  an  initial  disturbance  to  decay 
in  amplitude  to  zero  at  exponential  rates.  Convergence  of  any  disturbance  is  bounded 
by  the  decay  of  the  slowest  component  (3.11).  Although  it  is  possible  to  analyze  the 
convergence  of  the  static  problem  exactly  it  is  more  difficult  to  predict  convergence 
of  multiple  point  disturbances.  For  this  reason  the  next  chapter  presents  simulations 
of  multiple  point  disturbances. 

Algorithm  DIFFUSION  is  formulated  in  such  a  way  that  it  reduces  the  worst 
case  imbalance  between  any  two  processors  Yi,  V’j  by  a  factor  e.  In  a  practical  context 
it  can  be  useful  to  reduce  the  imbalance  below  a  threshold  which  is  a  percentage  of 
the  total  load.  For  example,  it  can  be  useful  to  assert  that  the  worst  case  imbalance 
will  be  within  10%  of  the  load  average  for  a  given  problem. 

To  use  the  algorithm  in  this  way  formulate  e  as  a  function  of  N  and  of  |0|,  the 
total  number  of  domain  elements.  Define  a  threshold  a  of  eg.  10%  and  assert  that  the 
solution  will  have  a  worst  case  imbalance  of  no  more  than  10%  of  the  load  average. 
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If  the  initial  imbalance  is  M  (  as  in  the  static  problem)  then  t  =  jj.  An  optimal  A 
can  be  found  for  any  e  by  minimizing  tv. 

Consider  the  example  of  a  computation  with  10'  grid  points  on  100  computers. 
The  average  load  is  107/102  =  10s.  If  a  —  10%  then  the  desired  worst  case  imbalance 
is  104.  Solving  e|fl|  =  a|fi|/IV  gives  elO7  =10-110710-2  e  =  10-3. 
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Chapter  4 


Simulations 


This  chapter  presents  simulations  of  three  generic  problems  of  interest  in  computa¬ 
tional  fluid  dynamics  and  of  a  problem  with  active  source  terms.  The  simulations 
which  follow  were  executed  with  a  fixed  time  step  A  =  0.1,  v  =  3  and  e  =  0.1.  The 
results  of  these  simulations  demonstrate  that  r  is  small  for  realistic  problems  of  inter¬ 
est  and  elapsed  time  is  a  fraction  of  a  second  for  all  instances.  As  table  3.1  suggests 
lower  times  can  be  achieved  by  permitting  A,  v  and  e  to  vary  with  N .  All  timings  are 
for  a  J-machine1  [34]  with  33  Mffz  processors. 

Spectral  bisection  [38]  is  a  popular  method  to  solve  instances  of  the  static  prob¬ 
lem.  This  method  can  require  considerable  cpu  time  for  large  problems.  Algorithm 
DIFFUSION  solved  an  instance  with  one  million  unknowns  and  512  computers  [41] 
in  a  few  hundred  milliseconds. 

Solution  methods  for  problems  in  fluid  dynamics  and  structural  analysis  often 
increase  the  density  of  a  computational  grid  in  response  to  properties  of  the  solution. 
A  bow  shock  resulting  from  the  preceding  calculation  and  a  generic  launch  sequence 
with  a  moving  boundary  were  simulated  for  a  problem  with  one  billion  unknowns 
on  one  million  computers.  The  grid  density  was  increased  100%  in  regions  of  the 
shock  and  400%  at  the  moving  boundary.  Both  of  these  disturbances  were  removed 
by  algorithm  DIFFUSION  in  a  few  hundred  milliseconds. 

The  algorithm  is  formulated  under  the  assumption  that  no  new  work  is  created 
during  the  time  the  algorithm  is  executing.  A  final  simulation  shows  that  the  algo¬ 
rithm  is  robust  even  in  the  presence  of  large  and  frequent  injections  of  new  work. 
Algorithm  DIFFUSION  kept  the  disturbance  below  the  magnitude  of  the  average 
injection  for  1000  iterations.  The  disturbance  quickly  dissipated  after  the  injection 

^he  J-machine  is  an  experimental  design  built  at  MIT. 
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68.75  ms 


103.125  ms  137.5  ms  171.875  ms 


206.25  ms  240.625  ms  275  ms 

Figure  4.1:  Static  load  balancing  problem  for  a  grid  of  one  million  points  on  a  system 
of  512  computers.  The  initial  imbalance  was  reduced  by  90%  in  6  steps.  Condition 
(2.1.b)  was  satisfied  to  within  10%  of  the  load  average  in  162  steps. 
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206.25  ms  240.625  ms  275  ms 

Figure  4.2:  Dynamic  load  balancing  problem  following  /i- refinement  adaptation  for 
a  grid  of  one  billion  points  on  a  system  of  one  million  computers.  Condition  (2.1.b) 
was  satisfied  to  within  10%  of  the  load  average  in  170  steps. 
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206.25  ms  240.625  ms  275  ms 

Figure  4.3:  Dynamic  load  balancing  problem  following  moving  boundary  adaptation 
for  a  grid  of  one  billion  points  on  a  system  of  one  million  computers.  Condition  (2.1.b) 
was  satisfied  to  within  10%  of  the  load  average  in  50  steps. 
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515.625  ms  1031.25  ms  1546.875  ms 


2062.5  ms  2578.125  ms  3093.75  ms 


3609.375  ms  4125  ms  4640.625  ms 

Figure  4.4:  Problem  with  active  sources.  Average  activity  per  iteration  is  30,000% 
of  the  initial  load  average.  Activity  ceased  at  step  1000.  Error  in  condition  (2.1.b) 
decreased  by  three  orders  of  magnitude  in  500  additional  iterations. 


Chapter  5 

Scalable  load  balancing  methods 


The  load  balancing  problem  has  been  the  subject  of  considerable  discussion  in  the 
past  decade  [2,  3,  4,  5,  6,  7,  9,  10,  11,  12,  13,  14,  15,  16,  19,  20,  22,  23,  25,  28, 
29,  30,  31,  32,  33,  37,  38,  39,  42,  43].  This  discussion  has  suffered  from  a  lack  of 
consensus  regarding  the  architectures  to  be  considered  and  the  constraints  under 
which  the  problem  shoidd  be  solved.  An  unfortunate  consequence  of  this  fact  is 
that  the  discussion  has  little  continuity  and  new  methods  rarely  build  on  previous 
work.  Most  methods  rely  on  simulation  to  demonstrate  their  effectiveness  and  rarely 
have  formal  proofs  of  correctness.  As  a  result  it  can  be  difficult  to  extrapolate  their 
behavior  to  architectures  other  than  those  on  which  they  were  simulated.  Very  few 
methods  consider  the  issues  of  scalability  and  adjacency.  Some  noteable  approaches 
among  this  group  of  methods  are  gradient  based  models  [29],  bidding  algorithms  [31], 
a  drafting  algorithm  [32]  and  a  method  based  on  queing  theory  [11],  Most  recently 
a  spectral  bisection  method  has  become  popular  for  problems  which  involve  complex 
geometries  [38]. 

5.1  Diffusion 

The  first  published  work  on  diffusion  methods  is  due  to  Cybenko  [14].  This  work 
proposes  a  method  to  compute  a  balanced  distribution  of  work  in  an  arbitrarily 
connected  graph.  The  method  assigns  a  weighting  factor  cqj  to  every  edge  i,j  of  a 
graph  where  each  vertex  i  is  associated  with  a  workload  w,t.  The  method  computes 
an  iteration  at  each  vertex  i  concurrently 

w]  <-  w]  +  ^  a,  (yi)  -  w\  J  +  III 

j 
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In  this  model  the  a  are  constrained  so  that  ai,j  =  1-  Let  M  denote  the  matrix 
where  AltJ  —  ahj  for  every  edge  i.j  in  the  graph  and  MUJ  =  0  if  there  is  no  edge 
between  vertices  i  and  j .  //  represents  a  stochastic  source  of  new  work  arising  during 
the  rebalancing  process.  Then  when  rj  —  0  the  iteration  can  be  expressed  as  a  matrix 
equation 

u5(*+1)  <-  Mw^  (5.1) 

The  article  demonstrates  convergence  of  (5.1)  and  hence  correctness  of  the  algo¬ 
rithm  under  stationary  conditions.1  The  constraint  on  a  requires  that  AI  has  row  and 
column  sums  ecpial  to  1  and  hence  the  spectral  radius  is  bounded  p(M )  <  1  ([24]  thm. 
8.1.22).  Since  AI  is  nonnegative  and  irreducible  the  Frobenius  generalization  implies 
that  AI  has  a  single  eigenvalue  of  modulus  1  and  that  the  corresponding  eigenvectors 
are  uniform  distributions  in  w.  AI  can  have  an  eigenvalue  of  -1  only  if  it  can  be 
partitioned  into  block  form 

[0  A  ' 

A*  0 

which  is  a  condition  for  the  interconnection  graph  to  be  bipartite.  Therefore  the 
algorithm  does  not  oscillate  so  long  as  the  interconnection  graph  is  not  bipartite. 

This  work  does  not  explicitly  consider  the  rate  of  convergence  but  suggests  an 
acceleration  technique  to  minimize  p(M ) 

AI  T  kI 

Mk  —  — — ■ - ,  k  =  —  (A„  +  A2)/2 

i  T  K 

Adding  kI  to  a  diagonal  AI  shifts  the  eigenvalues  A;  of  M  to  A;  +  k.  The  division  by 
I  +  k  normalizes  the  row  and  column  sums  of  M  to  1  which  is  necessary  in  order  for 
the  proof  of  convergence  to  hold. 

The  work  considers  the  problem  of  active  sources  rj  >  0  and  demonstrates  that  if 
a  is  the  variance  in  an  initial  distribution  and  n  the  number  of  processors  then  the 
asymptotic  expected  value  of  the  variance  is 

w(t 

The  article  concludes  by  considering  a  dimension  exchange  algorithm  for  binary 
hypercubes  in  which  all  cqj  =  1  jd  where  d  is  the  dimensionality  of  the  hypercube. 

■^he  proof  is  by  Frobenius’s  generalization  for  nonnegative  irreducible  matrices  of  Perron’s  the¬ 
orem  ([24]  thm.  8.2.11).  It  proves  boundedness  of  the  spectral  radius  and  gives  conditions  for  a  -1 
eigenvalue. 
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This  is  a  recursive  algorithm  in  which  each  dimension  in  turn  balances  the  work 
among  it’s  set  of  computers.  The  algorithm  terminates  after  d  serial  steps  and  scales 
with  elapsed  time  0(logd  N). 

This  article  has  been  a  precursor  of  several  subsequent  scalable  and  correct  load 
balancing  methods  for  distributed  memory  computers  [6,  13,  21,  25,  43].  For  example, 
implementations  of  this  algorithm  can  bear  a  strong  resemblance  to  algorithm  DIF¬ 
FUSION.  In  the  absence  of  source  terms  (r/  =  0)  and  under  the  assumption  that  the 
graph  represents  the  interconnection  structure  of  a  three  dimensional  mesh  connected 
multicomputer  matrix  M  in  (5.1)  is  equivalent  to  the  expression  AC  +  /  where  L  is 
taken  from  (3.5).  Cybenko’s  method  also  bears  a  strong  resemblance  to  a  proposal 
by  Boillat  [6].  This  method  is  also  formulated  for  an  arbitrary  interconnection  graph. 
When  applied  to  a  three  dimensional  interconnection  mesh  for  a  multicomputer  the 
algorithm  is 

fb+i)  =  (5.2) 

where  PG  is  a  matrix  with  the  same  sparsity  structure  as  L  in  equation  (3.5)  but 
where  every  nonzero  term  is  1/7.  While  this  does  not  correspond  directly  to  a  finite 
difference  expression  it  is  obviously  convergent  by  the  same  arguments  presented  in 
[14].  The  article  uses  Markov  techniques  to  demonstrate  that  the  iteration  (5.2)  is 
acyclic  and  converges  to  equilibrium.  It  considers  the  rate  of  convergence  and  derives 
an  upper  bound  for  eigenvalues.  On  a  three  dimensional  mesh  this  upper  bound  is 

A  <  1 - sm2  — 

7  2  n 

Using  this  bound  it  demonstrates  that  worst  case  convergence  has  an  upper  bound 
elapsed  time  complexity  of  0{n2).  It  is  unfortunate  that  this  method  has  received 
little  attention  as  it  is  one  of  the  few  which  has  been  implemented  in  the  context 
of  real  scientific  calculations  [7,  9].  This  may  be  partly  the  result  of  an  obscurity  of 
notation  and  presentation. 


5.2  Transfer  function 

A  novel  formulation  of  the  problem  of  calculating  Vt7  is  introduced  by  Conley  [13]. 
The  formulation,  which  arose  in  part  from  discussions  of  the  load  balancing  problem 
with  this  author,  requires  solution  of  an  elliptic  equation 

V2f  =  -Vu 


(5.3) 
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where  the  resulting  transfer  of  loads  is  effected  by 

A  u  =  V  f 


(6.4) 


If  uj  represents  the  average  workload  then  correctness  of  (5. 3), (5. 4)  follow  from 


V  f 
V  (v  ■  T ) 

-V  x  (V  x  f)  +  V2f 

v2f 


U(,  —  u 

Vuf,  —  Vw 


— V« 


-Vw 


(5.5) 


The  hnal  step  of  (5.5)  assumes  an  irrotational  constraint  V  X  T  =  0.  Irrotational 
solutions  are  a  subset  of  the  set  of  valid  solutions  of  problem  2.1. 

Problems  like  (5.3)  are  well  suited  to  solution  by  scalable  concurrent  iterations 
similar  to  (3.4).  Simulations  by  this  author  using  iterative  solutions  of  a  finite  differ¬ 
ence  formulation  of  (5.3)  have  shown  that  overall  solution  times  to  fine  Aw  are  similar 
to  times  for  algorithm  DIFFUSION. 

Transfers  must  occur  as  a  second  step  of  the  algorithm.  This  means  that  a  load 
balancing  method  based  on  this  calculation  would  have  a  serial  dependency.  This 
dependency  could  be  expensive  for  large  disturbances.  Diffusion  methods  do  not 
suffer  from  this  problem  because  they  transfer  work  over  many  steps  concurrently 
with  the  calculation  of  u' . 


5.3  A  multilevel  method 

Multigrid  methods  [40]  are  a  popular  way  to  accelerate  convergence  of  iterative  so¬ 
lution  methods  for  linear  systems  of  equations.  Horton  [25]  suggests  that  a  similar 
approach  can  accelerate  the  convergence  of  diffusion  methods. 

The  article  presents  a  “multilevel”  algorithm  for  load  balancing  which  has  loga¬ 
rithmic  elapsed  time  complexity.  This  algorithm  requires  that  the  aggregate  workload 
among  a  subset  of  computers  is  known  at  each  step.  Although  this  is  certainly  fea¬ 
sible  it  suggests  the  use  of  an  algorithm  similar  to  SIMPLE  and  SIMPLE2  of  the 
second  chapter.  This  suggests  that  this  algorithm  may  not  be  particularly  efficient  in 
comparison  to  diffusion  methods. 

The  inner  loop  of  a  “basic  diffusion  model”  is  presented  as 

for  all  neighbors  j  of  tj>i  do 

transfer  (iq  —  uj)/2  units  of  work  from  if>i  to  ij)j 


5.4.  A  DISTRIBUTED  TASK  POOL 
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In  one  dimension  this  transaction  is  equivalent  to 

=  «f°  -  \  (<*f°  -  «ti)  -  \  («f°  - «'!”!) 

This  is  algorithm  AVERAGE  which  was  shown  in  the  second  chapter  to  be  incorrect 
in  the  absence  of  periodic  boundary  conditions.  On  a  hypercube  architecture  this 
algorithm  is  identical  to  Cybenko’s  dimension  exchange. 


5.4  A  distributed  task  pool 

The  task  pool  algorithm  of  Hofstee  et  al  [22]  is  scalable  and  correct.  This  algorithm  is 
concerned  with  distributing  a  pool  of  tasks  to  a  set  of  computers  in  a  way  that  ensures 
load  balance.  Processes  are  assumed  to  be  independent  and  no  assumptions  are  made 
regarding  the  order  of  execution  or  adjacency  relationship  among  the  processes.  An 
example  of  an  application  for  which  this  algorithm  could  be  beneficial  might  be  an 
online  transaction  processing  system  in  which  a  large  number  of  tasks  (transactions) 
queue  in  a  task  pool  until  they  can  be  serviced  by  a  multicomputer.  The  algorithm 
is  not  intended  for  domain  decomposed  calculations  in  which  tasks  must  execute 
concurrently  and  in  which  the  adjacency  constraint  (2.1.c)  must  be  observed.  Because 
of  this  the  algorithm  appears  to  be  less  useful  for  grid  computations. 
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